initial learning rate
ADerivation of D1 Denote the logit vector as x, we have pj = exj
Without zero-mean constraint, the training becomes unstable. Following the training setting of [23], the classifier network is trained with SGD with a weight decay 5e-4, an initial learning rate of 1e-1 and a mini-batch size of 100 for all methods. We use the cosine learning rate decay schedule [49] for a total of 80 epochs. We set the outer level learning ฮทฯ as 14 Figure 7: Training curve without zero-mean constraint on CIFAR10 under 40% uniform noise. The MLP weighting network is trained with Adam [51] with a fixed learning rate 1e-3 and a weight decay 1e-4.
SU(2) = R(ฮธ, ฮธ, ฯ) = tkje P0 tkje T0 gkjt 0 ejฯWkjt 0 ejฮธL ฮธ! jฮธgsin 2 ฯcos 2 ฯej 2 0 = e cos
A.1 Mach-Zehnder Interferometers (MZIs) A basic coherent optical component used in this work is an MZI. One of the most general MZI structures is shown in Figure 15, consisting of two 50-by-50 optical directional couplers and four phase shifters ฮธ, ฮธ, ฯ, and ฯ. An MZI can achieve arbitrary 2 2 unitary matrices SU(2). Figure 15: 2-by-2 MZI with top (T), left (L), upper (P), and lower (W) phase shifters. A.2 MZI-based Photonic Tensor Core Architecture By cascading N(N 1)/2MZIs into a triangular mesh (Recks-style) or rectangular mesh (Clementsstyle), we can construct arbitrary N N unitary U(N).